Cut-and-Pick Transactions for Proxy Log Mining
نویسندگان
چکیده
Web logs collected by proxy servers, referred to as proxy logs or proxy traces, contain information about Web document accesses by many users against many Web sites. This “many-to-many” characteristic poses a challenge to Web log mining techniques due to the difficulty in identifying individual access transactions. This is because in a proxy log, user transactions are not clearly bounded and are sometimes interleaved with each other as well as with noise. Most previous work has used simplistic measures such as a fixed time interval as a determination method for the transaction boundaries, and has not addressed the problem of interleaving and noisy transactions. In this paper, we show that this simplistic view can lead to poor performance in building models to predict future access patterns. We present a more advanced cut-and-pick method for determining the access transactions from proxy logs, by deciding on more reasonable transaction boundaries and by removing noisy accesses. Our method takes advantage of the user behavior that in most transactions, the same user typically visits multiple, related Web sites that form clusters. These clusters can be discovered by our algorithm based on the connectivity among Web sites. By using real-world proxy logs, we experimentally show that this cut-and-pick method can produce more accurate transactions that result in Web-access prediction models with higher accuracy.
منابع مشابه
Restoring Meaningful Episodes in a Proxy Log
Web logs collected at proxy servers, referred to as proxy logs, contain rich information about Web user activities. These logs are becoming critical data sources for various Web applications such as Web log mining. However, a raw proxy log treated as a flat sequence of individual Web requests does not reliably represent correct information about Web user behavior, owing to a lack of semantic st...
متن کاملWhere Have You Been? A Comparison of Three Web Tracking Technologies
“Web tracking” is the process of gathering Web access information. Such access information can then be used for improving future browsing and access, or as input for data mining processes that analyze patterns. We have implemented three Web tracking systems, and evaluated and compared their performance and characteristics. In the first system, rather than connecting directly to Web sites, a cli...
متن کاملDeveloping a method for identification of net zones using log data and diffusivity equation
Distinguishing productive zones of a drilled oil well plays a very important role for petroleum engineers to decide where to perforate to produce oil. Conventionally, net pay zones are determined by applying a set of cut-offs on perophysical logs. As a result, the conventional method finds productive intervals crisply. In this investigation, a net index value is proposed, then; diffusivity equa...
متن کاملUser Profiling: Web Usage Mining
Web usage mining differs from collaborative filtering in the fact that we are not interested in explicitly discovering user profiles but rather usage profiles. When preprocessing a log file we do not concentrate on efficient identification of unique users but rather try to identify separate user sessions. These sessions are then used to form the so called transactions (see [3]). In the followin...
متن کاملEffect of Temporal Relationships in Associative Rule Mining for Web Log Data
The advent of web-based applications and services has created such diverse and voluminous web log data stored in web servers, proxy servers, client machines, or organizational databases. This paper attempts to investigate the effect of temporal attribute in relational rule mining for web log data. We incorporated the characteristics of time in the rule mining process and analysed the effect of ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002